#The Objective
Our data set consists of observations from the Large Hadron Collider at CERN over the year of 2012. It is ordered data of the form \(\{(\mathbf{x}_i,y_i,w_i)\}_{i \in D}\) where \(\mathbf{x} \in \mathbb{R}^{30}\); \(y \in \{\mathbf{b},\mathbf{s}\}\) and \(w \in [0,1]\) is a weight which measures the intensity of each data point.
Each \(\mathbf{x}_i\) is the collection of observables about each event and each \(y_i\) is the categorization of the event as either ‘Background’ \((= \mathbf{b})\) and ‘Signal’ \(=(\mathbf{s})\). We define \[ \mathcal{S} = \{i: y_i = \mathbf{s}\} \text{ and } \mathcal{B} = \{i: y_i =\mathbf{b}\} \] with \(n_s = \mid \mathcal{S} \mid\) and \(n_b = \mid \mathcal{B} \mid\) respectively. The weight variable is not to be used to train the classifier \(f\) in any way, it is only to compute the AMS, which is our measure of the accuracy of our classifier \(f\). Speaking of \(f\), we define \(\hat{f} = \{i: f(\mathbf{x}_i) = \mathbf{s}\}\), i.e the points labelled as signals by \(f\) and using this we define
\[ s = \sum_{i \in \mathcal{S}\cap \hat{f}}w_i \text{ and } b = \sum_{i \in \mathcal{B}\cap \hat{f}}w_i \]
i.e \(s\) and \(b\) are the weighted \(\textit{true positives}\) and \(\textit{false positives}\) respectively. Finally we define the AMS function is defined as follows:
\[ \text{AMS}(f) = \sqrt{2\left(s+b+b_{reg}\ln \left (1+ \frac{s}{b+b_{reg}} \right) -s \right)} \]
Where \(b\) and \(s\) are as above and \(b_{reg}\) is a normalizing constant. Notice
that this is computed over a data set \(D\), so if we wish to compute this over
some training or test set we must re-normalize the weights. We wrote a
function adjust_weights to do this re-normalization, as
well as a function approx_median_sig to calculate the
resulting AMS value.
Our task is to create some classifier \(f\) which maximizes the AMS.
As we have \(31\)-dimensional data we will perform preliminary analysis of the variables before running classification. The following is a loose structure of our method for tackling this problem:
We have a few types of data to consider:
Variables KaggleSet and KaggleWeight
can be ignored as they denote which data points were in the provided
Kaggle challenge and their relative weights, which is irrelevant to
us.
The discrete classification variable Label which
takes values in \(\{b,s\}\)
The continuous variable Weight, which will be used
to compute the AMS, and will not be used in classification.
The discrete variable PRI_jet_num denotes the amount
of jets from each event and takes values in \(\{0,1,2,3\}\).
All other variables are either direct or indirect measurements and are continuous.
Since we are classifying, we want to remove any redundancies in our data, fortunately we have the following co-dependencies:
PRI_jet_all_pt = PRI_jet_leading_pt + PRI_jet_subleading_pt
PRI_lep_pt = DER_pt_ratio_lep_tau*PRI_tau_pt
Jet-related variables depend on PRI_jet_num (see
next section for more)
We can either remove PRI_jet_all_pt or remove both
PRI_jet_leading_pt and PRI_jet_subleading_pt
to reduce the dimension by \(1\) or
\(2\). This same idea can be applied to
to the other (multiplicative) dependence. With these reductions we could
remove up to \(4\) dimensions from our
data.
If we wish to be more liberal with our removal, we could throw out
all jet related variables besides PRI_jet_num to remove
\(10\) more dimensions from our data.
However this is assuming that the other jet variables have little impact
on classification which has not been validated - yet.
Below are the variables which contain undefined values. By convention
it was provided to use with values -999 which is out of
range for every observation.
Notice that every column besides DER_mass_MMC is a jet
variable, and in those, we only have two values for
completion_rate. In fact these correspond to different
values of PRI_jet_num:
0 jets: All jet variables were missing data.
1 jet: only
PRI_leading_pt,PRI_leading_eta and
PRI_leading_phi have data.
2 or more jets: All jet variables have data.
The above result can be found in the handbook for the variables
provided with the Kaggle challenge. Unfortunately
DER_mass_MMC does not have such an explanation and may be a
result of some event during measurement. However it is still assigned
the same -999 value as the other missing data.
Despite understanding the cause of (most of) our missing data, we
still don’t know the impact, the following section is dedicated to
understanding the correlation between the missing data from each
variable and its classification. Take DER_mass_MMC for
example, if we find that the missing data is distributed uniformly
across \(y_i \in \{b,s\}\) then the
appearance of NA has no impact on classification, however
if it is disproportionately skewed towards \(b\) then we can use this to assist our
classification.
Below we compute the percentage of NA data in Background
and Signal respectively. If this missing data is distributed uniformly
across Background and Signal then the percentages should be very
close.
Notice that Signal consistently has \(13-18 \%\) less missing data than
Background. This means that in both the
DER_mass_MMC and jet-variables cases, the amount of missing
data is indicative of a classification. Meaning that including
-999 (or some other extreme value) can help with
classification.
To get an understanding of the behavior of variable given their classification, we plotted their densities. The entire list can be found in the appendix but here are a few of note:
Note that the data for these plots the variables have NA
instead of -999 for missing data, again see appendix for
plots with -999 included. Some key takeaways from these
plots are:
Many of our parameters are over the range \([-\pi,\pi]\) as they correspond to the angles observed in interactions.
We see that some these angle variables:
PRI_tau_phi,PRI_lep_phi,
PRI_met_phi,PRI_jet_leading_phiand
PRI_jet_subleading_phi are uniformly distributed and that
there is very little variation between the classifications. Additionally
all except PRI_jet_subleading_phi are unaffected when we
include the -999 values. This leads us to consider
rejecting them from our classification, reducing the dimension by up to
5.
The parameters PRI_tau_eta and
PRI_lep_eta have exotic distributions which vary with \(b\) and \(s\)
We have many single peak distributions such as
DER_mass_MMC,DER_mass_transverse_met_lep and
DER_pt_ratio_lep_tau which have clear variation between the
classes.
We can see that the Jet-related variables have clear variation
between the classes, reinforcing our need to use them and their
corresponding -999 values.
While plots are very pretty, we must have a justifiable mathematical result to validate any variable selection. Below we plot the mutual information between \(V \mid y='s'\) and \(V \mid y= 'b'\) respectively.
Below we plot the joint and product distributions for
SOME_VARIABLE as a visual aid to how the mutual information
is computed.
As you can see for GOOD_EXAMPLE we can see a large
variation between the joint and product distributions, whereas with
BAD_EXAMP we have almost identical distributions.
Below we display the mutual information of all of our variables.
FDA creates an embedding that tells us how linearly separated our
data can possibly be by maximizing between class scatterness and
minimizing within class scatterness. This can be a useful aid to find
out whether a linear classifier is appropriate and can additionally be
used to see how removing variables affects the separability of our data.
We created two new functions in order to perform FDA:
fisher_discrim to find the embedding vector that maximises
the ratio of between- to within-class scatterness and
scatter_ratio to calculate this ratio for a given dataset
and embedding vector. Code for these functions can be found in our R
package higgiesmalls.
We consider four increasing cases of variables elimination: 1) Including all variables; 2) Removing all 4 algebraic co-dependencies; 3) Removing the above and any uniform variables; 4) Removing the above and the 10 variables with lowest mutual information. The following plots show the results of the FDA embedding for each of these cases.
For these different FDA embeddings, we can compare the ratio between between class scatterness and within class scatterness:
We find that removing variables slightly lowers the linear separability of the data, but in general the values are comparable. Using only the 10 variables with the highest mutual information, however, yields a lower scatter ratio. Based on the plots, we can conclude that it is not possible to achieve a high degree of linear separability between the signal and background events, indicating that a feature transform is most likely appropriate.
We divided our dataset into a training and testing split, training SVMs on the training data and finding the AMS on the testing data.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(mut_info_vars)
##
## # Now:
## data %>% select(all_of(mut_info_vars))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
By training SVMs with a linear kernel vs. a radial kernel on a 20% training set (using all predictor variables), we found that a radial kernel consistently yielded higher AMS:
Based on these results, along with the results of FDA indicating that
the dataset is not linearly separable, we moved forward using a radial
kernel. We performed 5-fold cross-validation over the 20% training set
to tune the two hyperparameters using a grid search: cost and \(\gamma\) (bandwidth). We allowed cost to
take on the values 0.5, 1, and 2 and \(\gamma\) to take on the values 0.01, 0.1,
and 0.5. Tuning was performed using our function
ams_tune_parallel, which modifies the
e1071::tune function to maximise AMS and to conduct
cross-validation in parallel, dramatically improving the performance of
tuning compared to the original package.
Below are the results of the hyperparameter tuning using the three sets of variables: 1) Including all variables; 2) Removing the algebraic co-dependencies and uniformly distributed variables; 3) Using only the 10 variables with the highest mutual information.
# SVM using all variables:
svm_radial_tune_all <- readRDS(here("output/svm_radial_tune_20_gc.RDS"))
svm_radial_tune_all$best.parameters
## cost gamma
## 5 1 0.1
# SVM without co-dependencies and uniform variables
svm_radial_tune_drop_codep_unif <- readRDS(here("output/svm_radial_tune_drop_codep_unif_20_gc.RDS"))
svm_radial_tune_drop_codep_unif$best.parameters
## cost gamma
## 6 2 0.1
# SVM using top 10 mutual information
svm_radial_tune_mut_info <- readRDS(here("output/svm_radial_tune_mut_info_20.RDS"))
svm_radial_tune_mut_info$best.parameters
## cost gamma
## 9 2 0.5
We then used the model using the best hyperparameter values from tuning to find the AMS value on the 80% testing dataset.
NA entriesNA entries-999 entries